Local Feature Matching

By the end of this exercise, you will be able to transform images of a flat (planar) object, or images taken from the same point into a common reference frame. This is at the core of applications such as panorama stitching.

A quick overview:

  1. We will start with histogram representations for images (or image regions).
  2. Then we will detect robust keypoints in images and use simple histogram descriptors to describe the neighborhood of each keypoint.
  3. After this we will compare descriptors from different images using a distance function and establish matching points.
  4. Using these matching points we will estimate the homography transformation between two images of a planar object (wall with graffiti) and use this to warp one image to look like the other.

Important: Follow the instructions below when submitting your attempt. Submissions not following these instructions will not be graded.

  1. Submit in teams of 3 or 4 students, add their names and matriculation numbers below. Only one team member should upload the solutions.
  2. Use jupyter notebook. Other notebook-editing software (e.g. jupyter-lab, pycharm) might corrupt the notebook files and could have issues with displaying matplotlib interactively.
  3. Do not remove, modify or duplicate any given cells, except those in which you need to fill in your implementation. You can add new cells in order to present additional texts or plots.
  4. Restart the kernel and re-run the whole notebook once before submission. After this step, the cell id should be incremental from top to bottom, and all plots should be displayed.
  5. Submit only the .ipynb files, do not upload archives (zip, rar, tar, etc.), images or datasets.
  6. Do not change the filenames of the .ipynb files.

Team members (names and matriculation numbers):

Histograms in 1D

If we have a grayscale image, creating a histogram of the gray values tells us how frequently each gray value appears in the image, at a certain discretization level, which is controlled by the number of bins.

Implement compute_1d_histogram(im, n_bins). Given an grayscale image im with shape [height, width] and the number of bins n_bins, return a histogram array that contains the number of values falling into each bin. Assume that the values (of the image) are in the range [0,1), so the specified number of bins should cover the range from 0 to 1. Normalize the resulting histogram to sum to 1. What is the effect of the different bin counts?

Histograms in 3D

If the pixel values are more than one-dimensional (e.g. three-dimensional RGB, for red, green and blue color channels), we can build a multi-dimensional histogram. In the R, G, B example this will tell us how frequently each combination of R, G, B values occurs. (Note that this contains more information than simply building 3 one-dimensional histograms, each for R, G and B, separately. Why?)

Implement a new function compute_3d_histogram(im, n_bins), which takes as input an array of shape [height, width, 3] and returns a histogram of shape [n_bins, n_bins, n_bins]. Again, assume that the range of values is [0,1) and normalize the histogram at the end.

Visualize the RGB histograms of the images sunset.png and terrain.png using the provided code and describe what you see. We cannot use a bar chart in 3D. Instead, in the position of each 3D bin ("voxel"), we have a sphere, whose volume is proportional to the histogram's value in that bin. The color of the sphere is simply the RGB color that the bin represents. Which number of bins gives the best impression of the color distribution?

Histograms in 2D

Now modify your code to work in 2D. This can be useful, for example, for a gradient image that stores two values for each pixel: the vertical and horizontal derivative. Again, assume the values are in the range [0,1).

Since gradients can be negative, we need to pick a relevant range of values an map them linearly to the range of [0,1) before applying compute_2d_histogram. This is implemented by the function map_range provided at the beginning of the notebook.

In 2D we can plot the histogram as an image. For better visibility of small values, we plot the logarithm of each bin value. Yellowish colors mean high values. The center is (0,0). Can you explain why each histogram looks the way it does for the test images?

Similar to the function compute_gradient_histogram above, we can build a "Mag/Lap" histogram from the gradient magnitudes and the Laplacians at each pixel. Refer back to the first exercise to refresh your knowledge of the Laplacian. Implement this in compute_maglap_histogram!

Make sure to map the relevant range of the gradient magnitude and Laplacian values to [0,1) using map_range(). For the magnitude you can assume that the values will mostly lie in the range [0, 15) and the Laplacian in the range [-5, 5).

Comparing Histograms

The above histograms looked different, but to quantify this objectively, we need a distance measure. The Euclidean distance is a common one. Implement the function euclidean_distance, which takes two histograms $P$ and $Q$ as input and returns their Euclidean distance:

$$ \textit{dist}_{\textit{Euclidean}}(P, Q) = \sqrt{\sum_{i=1}^{D}{(P_i - Q_i)^2}} $$

Another commonly used distance for histograms is the so-called chi-squared ($\chi^2$) distance, commonly defined as:

$$ \chi^2(P, Q) = \frac{1}{2} \sum_{i=1}^{D}\frac{(P_i - Q_i)^2}{P_i + Q_i + \epsilon} $$

Where we can use a small value $\epsilon$ is used to avoid division by zero. Implement it as chi_square_distance. The inputs hist1 and hist2 are histogram vectors containing the bin values. Remember to use numpy array functions (such as np.sum()) instead of looping over each element in Python (looping is slow).

Now let's take the image obj1__0.png as reference and let's compare it to obj91__0.png and obj94__0.png, using an RGB histogram, both with Euclidean and chi-square distance. Can you interpret the results?

You can also try other images from the "model" folder.

Keypoint Detection

Now we turn to finding keypoints in images.

Harris Detector

The Harris detector searches for points, around which the second-moment matrix $M$ of the gradient vector has two large eigenvalues (This $M$ is denoted by $C$ in the Grauman & Leibe script). This matrix $M$ can be written as:

$$ M(\sigma, \tilde{\sigma}) = G(\tilde{\sigma}) \star \left[\begin{matrix} I_x^2(\sigma) & I_x(\sigma) \cdot I_y(\sigma) \cr I_x(\sigma)\cdot I_y(\sigma) & I_y^2(\sigma) \end{matrix}\right] $$

Note that the matrix $M$ is computed for each pixel (we omitted the $x, y$ dependency in this formula for clarity). In the above notation the 4 elements of the second-moment matrix are considered as full 2D "images" (signals) and each of these 4 "images" are convolved with the Gaussian $G(\tilde{\sigma})$ independently. We have two sigmas $\sigma$ and $\tilde{\sigma}$ here for two different uses of Gaussian blurring:

Instead of explicitly computing the eigenvalues $\lambda_1$ and $\lambda_2$ of $M$, the following equivalences are used:

$$ \det(M) = \lambda_1 \lambda_2 = (G(\tilde{\sigma}) \star I_x^2)\cdot (G(\tilde{\sigma}) \star I_y^2) - (G(\tilde{\sigma}) \star (I_x\cdot I_y))^2 $$$$ \mathrm{trace}(M) = \lambda_1 + \lambda_2 = G(\tilde{\sigma}) \star I_x^2 + G(\tilde{\sigma}) \star I_y^2 $$

The Harris criterion is then:

$$ \det(M) - \alpha \cdot \mathrm{trace}^2(M) > t $$

In practice, the parameters are usually set as $\tilde{\sigma} = 2 \sigma, \alpha=0.06$. Read more in Section 3.2.1.2 of the Grauman & Leibe script (grauman-leibe-ch3-local-features.pdf in the Moodle).


Write a function harris_score(im, opts) which:

To handle the large number of configurable parameters in this exercise, we will store them in an opts object. Use opts.sigma1 for $\sigma$, opts.sigma2 for $\tilde{\sigma}$ and opts.alpha for $\alpha$.

Furthermore, implement nms(scores) to perform non-maximum suppression of the response image.

Then look at score_map_to_keypoints(scores, opts). It takes a score map and returns an array of shape [number_of_corners, 2], with each row being the $(x,y)$ coordinates of a found keypoint. We use opts.score_threshold as the threshold for considering a point to be a keypoint. (This is quite similar to how we found detections from score maps in the sliding-window detection exercise.)

Now check the score maps and keypoints:

Hessian Detector

The Hessian detector operates on the second-derivative matrix $H$ (called the “Hessian” matrix)

$$ H = \left[\begin{matrix}I_{xx}(\sigma) & I_{xy}(\sigma) \cr I_{xy}(\sigma) & I_{yy}(\sigma)\end{matrix}\right] \tag{6} $$

Note that these are second derivatives, while the Harris detector computes products of first derivatives! The score is computed as follows:

$$ \sigma^4 \det(H) = \sigma^4 (I_{xx}I_{yy} - I^2_{xy}) > t \tag{7} $$

You can read more in Section 3.2.1.1 of the Grauman & Leibe script (grauman-leibe-ch3-local-features.pdf in the Moodle).


Write a function hessian_scores(im, opts), which:

Use opts.sigma1 for computing the Gaussian second derivatives.

Region Descriptor Matching

Now that we can detect robust keypoints, we can try to match them across different images of the same object. For this we need a way to compare the neighborhood of a keypoint found in one image with the neighborhood of a keypoint found in another. If the neighborhoods are similar, then the keypoints may represent the same physical point on the object.

To compare two neighborhoods, we compute a descriptor vector for the image window around each keypoint and then compare these descriptors using a distance function.

Inspect the following compute_rgb_descriptors function that takes a window around each point in points and computes a 3D RGB histogram and returns these as row vectors in a descriptors array.

Now write the function compute_maglap_descriptors, which works very similarly to compute_rgb_descriptors, but computes two-dimensional gradient-magnitude/Laplacian histograms. (Compute the gradient magnitude and the Laplacian for the full image first. See also the beginning of this exercise.) Pay attention to the scale of the gradient-magnitude values.

Now let's implement the distance computation between descriptors. Look at compute_euclidean_distances. It takes descriptors that were computed for keypoints found in two different images and returns the pairwise distances between all point pairs.

Implement compute_chi_square_distances in a similar manner.

Given the distances, a simple way to produce point matches is to take each descriptor extracted from a keypoint of the first image, and find the keypoint in the second image with the nearest descriptor. The full pipeline from images to point matches is implemented below in the function find_point_matches(im1, im2, opts).

Experiment with different parameter settings. Which keypoint detector, region descriptor and distance function works best?

Homography Estimation

Now that we have these pairs of matching points (also called point correspondences), what can we do with them? In the above case, the wall is planar (flat) and the camera was moved towards the left to take the second image compared to the first image. Therefore, the way that points on the wall are transformed across these two images can be modeled as a homography. Homographies can model two distinct effects:

We are dealing with the second case in these graffiti images. Therefore if our point matches are correct, there should be a homography that transforms image points in the first image to the corresponding points in the second image. Recap the algorithm from the lecture for finding this homography (it's called the Direct Linear Transformation, DLT). There is a 2 page description of it in the Grauman & Leibe script (grauman-leibe-ch5-geometric-verification.pdf in the Moodle) in Section 5.1.3.


Now let's actually put this into practice. Implement estimate_homography(point_matches), which returns a 3x3 homography matrix that transforms points of the first image to points of the second image. The steps are:

  1. Build the matrix $A$ from the point matches according to Eq. 5.7 from the script.
  2. Apply SVD using np.linalg.svd(A). It returns $U,d,V^T$. Note that the last return value is not $V$ but $V^T$.
  3. Compute $\mathbf{h}$ from $V$ according to Eq. 5.9 or 5.10
  4. Reshape $\mathbf{h}$ to the 3x3 matrix $H$ and return it.

The input point_matches contains as many rows as there are point matches (correspondences) and each row has 4 elements: $x, y, x', y'$.

The point_matches have already been sorted in the find_point_matches function according to the descriptor distances, so the more accurate pairs will be near the beginning. We can use the top $k$, e.g. $k=10$ pairs in the homography estimation and have a reasonably accurate estimate. What $k$ give the best result? What happens if you use too many? Why?

We can use cv2.warpPerspective to warp the first image to the reference frame of the second. Does the result look good?

Can you interpret the entries of the resulting $H$ matrix and are the numbers as you would expect them for these images?

You can also try other image from the graff5 folder or the NewYork folder.